Skip to content

Conversation

@2010YOUY01
Copy link
Contributor

Follow up to #560

Rationale

read_parquet() now supports an override option, geometry_columns, to cast Binary columns to geometries. This is an unsafe operation, so adding a validation step is helpful.

This PR currently limits validation to WKB validity. In the long term, additional metadata properties can also be validated. If any entry fails validation, an error will be returned.

Geo columns are inferred from {(Geo)Parquet metadata, user-provided geometry_columns override options}. The validate option applies uniformly in all cases—if validate = true, geometry columns are always validated. I think this approach is more general and simpler.

Implementation

Propagate the validate option to GeoParquetFileOpener (which yields the final decoded batches as the data source output), and use it to optionally validate the decoded batches for WKB validity.

pub object_store: Arc<dyn ObjectStore>,
pub metadata_size_hint: Option<usize>,
pub predicate: Arc<dyn PhysicalExpr>,
pub predicate: Option<Arc<dyn PhysicalExpr>>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor 1: before it's required since GeoParquetFileOpener is only used when there is geo column, and also there is valid spatial predicate, otherwise it fall back to inner parquet opener.

Now it also have to be used if validation is enabled, and there is no spatial predicate, so making it optional.

pub enable_pruning: bool,
pub metrics: GeoParquetFileOpenerMetrics,
pub overrides: Option<HashMap<String, GeoParquetColumnMetadata>>,
pub options: TableGeoParquetOptions,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor 2: before overrides lives inside TableGeoParquetOptions, the new validate flag also lives inside that option, so I think only keeping the option here can make it simpler.

metadata_size_hint: Option<usize>,
predicate: Option<Arc<dyn PhysicalExpr>>,
overrides: Option<HashMap<String, GeoParquetColumnMetadata>>,
options: TableGeoParquetOptions,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor 3: similar to another refactor in file_opener.rs

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few optional things to consider...thank you!

Just curious...have you run into files where this is a problem? I imagine if it does happen the existing error would be rather cryptic (although I believe we're good about validating WKB payload in every operator).

Comment on lines 318 to 323
def test_read_parquet_validate_wkb_single_valid_row(con, tmp_path):
valid_wkb = bytes.fromhex("0101000000000000000000F03F0000000000000040")

table = pa.table({"id": [1], "geom": [valid_wkb]})
path = tmp_path / "single_valid_wkb.parquet"
pq.write_table(table, path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, but there is a io/test_parquet.py that is possibly a better fit for these tests (and the geometry_columns tests) since there are now a lot of them!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 99d52a2

Comment on lines 233 to 247
DataType::Binary => {
let array = as_binary_array(array)?;
for (row_index, maybe_wkb) in array.iter().enumerate() {
if let Some(wkb_bytes) = maybe_wkb {
if let Err(e) = wkb::reader::read_wkb(wkb_bytes) {
return exec_err!(
"WKB validation failed for column '{}' at row {}: {}",
column_name,
row_index,
e
);
}
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're here, it would be good to handle the LargeBinary case (this can happen for reasons outside a GeoParquet producer's control due to various auto-store/load of schemas)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 99d52a2

2010YOUY01 and others added 2 commits February 8, 2026 11:53
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
@2010YOUY01
Copy link
Contributor Author

A few optional things to consider...thank you!

Just curious...have you run into files where this is a problem? I imagine if it does happen the existing error would be rather cryptic (although I believe we're good about validating WKB payload in every operator).

Thank you for the review! I have not encountered corrupted file so far, but my concern is the same: if this error is caught inside operators, the error might get propagated several times and produce confusing error messages.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@paleolimbot paleolimbot merged commit bc98ac4 into apache:main Feb 9, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants